Agentic Pipeline
Table of Contents
- Agentic Pipeline
✨️ Overview
Agentic Pipeline is ROLL's core pipeline for agent training, supporting multiple algorithms such as PPO, GRPO, and more. It provides the following core advantages:
- Gym-like Environment Definition: Supports various environment types, including FrozenLake, Sokoban, etc., and can easily extend custom environments through gym-like interfaces.
- Rich Learning Granularity: Supports TrajectoryWise form (StarPO) and StepWise (GiGPO) training forms.
- Asynchronous Parallel Rollout at Environment Granularity: Independent trajectory sampling across environments improves sampling efficiency.
- Asynchronous Training: Decoupling of rollout/training supports asynchronous training.
- Multi-turn Interaction Support for Local Debugging: Multi-turn interaction rollout supports local debugging, improving development efficiency for multi-turn interaction business.
- Flexible Policy Configuration: Supports multiple distributed training strategies such as Megatron, DeepSpeed, vLLM, etc., allowing flexible configuration based on hardware resources.
✨️ Core Components
Main Module (AgenticPipeline)
AgenticPipeline (located at roll/pipeline/agentic/agentic_pipeline.py) is the main process for the entire agent training. It manages the complete training workflow, including:
- Initializing and managing distributed worker processes (Actor, Critic, Reference, etc.).
- Coordinating environment interaction and data collection.
- Executing model training steps.
- Handling checkpoint saving.
- Recording metrics and experiment tracking.
Source Code: roll/pipeline/agentic/agentic_pipeline.py
Configuration File (AgenticConfig)
AgenticConfig (defined in roll/pipeline/agentic/agentic_config.py) is a configuration object based on Pydantic/dataclass used to specify all parameters for running AgenticPipeline. This configuration system supports YAML file configuration and uses the Hydra framework for management.
For configuration system description, see config_system
Configuration Structure and Organization
Configuration files (such as examples/qwen2.5-0.5B-agentic/agent_val_frozen_lake.yaml) are organized by functional modules and mainly include the following sections:
-
Basic Experiment Settings
exp_name: Experiment name, used to identify a specific training taskseed: Random seed to ensure reproducible experimentslogging_dir: Path to save log filesoutput_dir: Path to save model checkpoints and output filesrender_save_dir: Path to save rendered frames (for environment visualization)
-
Training Control Parameters
max_steps: Maximum training stepssave_steps: Frequency of saving model checkpointslogging_steps: Frequency of recording training metricseval_steps: Frequency of performing validation evaluationresume_from_checkpoint: Whether to resume training from a checkpoint. To continue training, set to its path; otherwise, set toFalse.
-
Model Configuration
pretrain: Pretrained model pathreward_pretrain: Reward model pretrained weights path
-
Algorithm Parameters
adv_estimator: Advantage estimator type (such asgae,grpo,reinforce)ppo_epochs: Number of optimization epochs per sample batchgamma: Discount factor for calculating returnslambd: Lambda parameter in GAEpg_clip: Clipping range for PPO policy gradient lossinit_kl_coef: Initial coefficient for KL penaltytarget_kl: Target KL value for adaptive KL controlwhiten_advantages: Whether to whiten advantagesentropy_loss_coef: Coefficient for entropy loss
-
Worker Process Configuration Each worker process (
actor_train,actor_infer,critic,reference) configuration includes:- Model Parameters (
model_args)model_type: Model type (such ascausal_lm)dtype: Computation precision (such asbf16,fp16)attn_implementation: Attention implementation (such asfa2)disable_gradient_checkpointing: Whether to disable gradient checkpointing
- Training Parameters (
training_args)learning_rate: Learning rateper_device_train_batch_size: Training batch size per devicegradient_accumulation_steps: Gradient accumulation stepsweight_decay: Weight decay coefficientwarmup_steps: Learning rate warmup stepslr_scheduler_type: Learning rate scheduler type
- Generation Parameters (
generating_args)max_new_tokens: Maximum new tokens to generatetop_p: Nucleus sampling parametertemperature: Temperature parameternum_return_sequences: Number of return sequences
- Distributed Strategy (
strategy_args)strategy_name: Distributed strategy used (such asmegatron_train,vllm,hf_infer)- Strategy-specific parameters: such as
tp_size(tensor parallel size),pp_size(pipeline parallel size) gpu_memory_utilization: GPU memory utilization (specific to vLLM)
- Device Mapping (
device_mapping)- Specifies which GPU devices the worker process should use
- Model Parameters (
-
Environment Manager Configuration
train_env_manager: Training environment manager configurationval_env_manager: Validation environment manager configuration- Environment-related parameters:
num_env_groups: Number of environment groupsgroup_size: Number of environments per grouptags: List of environment tagsnum_groups_partition: Group allocation for each environment typemax_env_num_per_worker: Maximum number of environments per worker
✨️ Environment Preparation
Environment Types
Agentic Pipeline supports various environment types, including but not limited to:
- FrozenLake: Classic reinforcement learning environment where the agent needs to find a path to the goal on ice.
- Sokoban: Box-pushing game environment where the agent needs to push boxes to designated positions.
- WebShop: Simulated online shopping environment where the agent needs to find suitable products based on user requirements.
- More environment support...
Environment Configuration
In the configuration file, custom environments are defined through the custom_envs field. Each environment configuration includes:
env_type: Environment typeenv_config: Specific environment configuration parametersmax_tokens_per_step: Maximum tokens per step